Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: avro-sdc #18

Merged
merged 3 commits into from
Jul 12, 2023
Merged

feat: avro-sdc #18

merged 3 commits into from
Jul 12, 2023

Conversation

sebastianswms
Copy link
Collaborator

Feature additions:

  • Avro file parsing
    • Multiple schema coercion strategies.
  • _sdc columns
    • Information on each record's file name and line number.
    • Enable/disable with additional_info config.
  • Header and footer configuration
    • header_skip and footer_skip to ignore the beginning and end of delimited files.
    • override_headers to allow headerless delimited files and/or renaming of existing headers.

Closes #5
Closes #11
Closes #12

Additional info in _sdc columns
header_skip, footer_skip, override_headers
@sebastianswms sebastianswms marked this pull request as ready for review June 23, 2023 20:32
@sebastianswms sebastianswms requested a review from visch June 23, 2023 20:32
@sebastianswms sebastianswms marked this pull request as draft June 23, 2023 20:51
@sebastianswms sebastianswms removed the request for review from visch June 23, 2023 20:51
README.md Outdated
| quote_character | False | " | The character used to indicate when a record in a CSV contains a delimiter character. |
| header_skip | False | 0 | The number of initial rows to skip at the beginning of each delimited file. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be just for delimited files right? same for delimited, header_skip, footer_skip, and override_headers

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should only be for delimited files. I've prepended delimited_ to each name to make this more clear.

| delimiter | False | detect | The character used to separate records in a CSV/TSV. Can be any character or the special value `detect`. If a character is provided, all CSV and TSV files will use that value. `detect` will use `,` for CSV files and `\t` for TSV files. |
| file_type | False | delimited | Can be any of `delimited`, `jsonl`, or `avro`. Indicates the type of file to sync, where `delimited` is for CSV/TSV files and similar. Note that *all* files will be read as that type, regardless of file extension. To only read from files with a matching file extension, appropriately configure `file_regex`. |
| compression | False | detect | The encoding to use to decompress data. One of `zip`, `bz2`, `gzip`, `lzma`, `xz`, `none`, or `detect`. If set to `none` or any encoding, that setting will be applied to *all* files, regardless of file extension. If set to `detect`, encodings will be applied based on file extension. |
| additional_info | False | 1 | If `True`, each row in tap's output will have two additional columns: `_sdc_file_name` and `_sdc_line_number`. If `False`, these columns will not be present. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaulting to True I think makes more sense here. Does it apply for all file types though? Like avro I'm not sure it does 🤷‍♂️ we'd have to force some kind of implementation for each stream type maybe some can just return n/a or something if it doesn't make sense.

Copy link
Collaborator Author

@sebastianswms sebastianswms Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It already does default to True. required=False, default=1. That's just how --about format=markdown does it.

I think it applies to all file types, or at least all file types we have so far. For avro it just increments _sdc_line_number for each record in a file and resets the count for each new file.

header_skip = self.config["header_skip"]
footer_skip = self.config["footer_skip"]

for reader_dict in self._get_readers():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that the header and footer are being first ran through DictReader right? I think this would cause issues for most of the use cases for this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see why. I've updated it to process delimited_header_skip and delimited_footer_skip before parsing with DictReader.

Refactor header_skip and footer_skip to process skipping before parsing.
@sebastianswms sebastianswms marked this pull request as ready for review June 27, 2023 17:04
@sebastianswms sebastianswms requested a review from visch June 27, 2023 17:04
@visch
Copy link
Member

visch commented Jul 12, 2023

Let's get this merged and conficts fixed, and then fix Pytest etc. Looks good to me

@visch visch merged commit e083c0d into MeltanoLabs:main Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants